Comparing manual and automated extraction of chemical entities from documents

نویسندگان

  • Christian Tyrchan
  • Sorel Muresan
چکیده

The chemical information landscape is changing rapidly with a yearly increase of over 1 million new compounds and more than 700,000 publications related to chemistry [1]. Exploring the chemical space covered by relevant journals and patents is a crucial step in early stage medicinal chemistry projects. Extracting chemical entities from unstructured text is a complex task and different approaches are currently used including manual extraction by expert curators, text mining supported by chemical NER or combinations thereof [2]. The chemical information and corresponding annotations are subsequently stored in relational databases allowing for complex chemical and text queries. To assess the capability of chemical NER in documents and to understand the coverage and accuracy of the underlying data we compared the chemistry extracted by manual curation (GVKBIO) and text mining (SureChem) from a small patent corpus. • GVKBIO databases are populated with explicit relationships between compounds, assays and sequence identifiers that have been manually extracted from journals and patents on a large scale [3]. • SureChem Portal [4] is a gateway for chemical patent search on full text collections for USPTO, EPO and WO. SureChem users can perform structure and keyword searches on more than 9 million unique compounds. We have selected a set of 250 patents covering various target classes and for which a minimum of 25 records per patents were retrieved from GVKBIO Patent database. The analysis was done using PipelinePilot protocols [5]. These initial results demonstrate the benefits and challenges of text mining for chemical information extraction from unstructured text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

BACKGROUND First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of ch...

متن کامل

مقایسه شمارش خودکار با شمارش دستی نوتروفیل‌ها در تشخیص پریتونیت باکتریال خود به خودی

Introduction: Spontaneous bacterial peritonitis (SBP) is a prevalent complication in the patients with cirrhosis and ascites, which leads to high intrahospital mortality. Diagnosis is made when ascetic fluid neutrophils is ≥250 cells/mm3. Manual counting of neutrophils is time-consuming, technically difficult, expensive and in many cases individual-dependent. In contrast, automated counting ...

متن کامل

Fragmentation measurement using image processing

In this research, first of all, the existing problems in fragmentation measurement are reviewed for the sake of its fast and reliable evaluation. Then, the available methods used for evaluation of blast results are mentioned. The produced errors especially in recognizing the rock fragments in computer-aided methods, and also, the importance of determination of their sizes in the image analysis ...

متن کامل

Histogram analysis with automated extraction of brain-tissue region from whole-brain CT images.

To determine whether an automated extraction of the brain-tissue region from CT images is useful for the histogram analysis of the brain-tissue region was studied. We used the CT images of 11 patients. We developed an automatic brain-tissue extraction algorithm. We evaluated the similarity index of this automated extraction method relative to manual extraction, and we compared the mean CT numbe...

متن کامل

Automated extraction and characterisation of social network data from unstructured sources - An ontologybased approach

Automated extraction of social network related data is one objective of the applied research project on SNA in Counter‐Insurgency context (SNAC) at DRDC Valcartier. Since the vast majority of the information resides in unstructured text documents, the prototype must be able to extract social network related data directly from them. For these tasks, the prototype leverages and refines exist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2010